Goto

Collaborating Authors

 word2v ec model


Evaluating Embedding Frameworks for Scientific Domain

arXiv.org Artificial Intelligence

Finding an optimal word representation algorithm is particularly important in terms of domain specific data, as the same word can have different meanings and hence, different representations depending on the domain and context. While Generative AI and transformer architecture does a great job at generating contextualized embeddings for any given work, they are quite time and compute extensive, especially if we were to pre-train such a model from scratch. In this work, we focus on the scientific domain and finding the optimal word representation algorithm along with the tokenization method that could be used to represent words in the scientific domain. The goal of this research is two fold: 1) finding the optimal word representation and tokenization methods that can be used in downstream scientific domain NLP tasks, and 2) building a comprehensive evaluation suite that could be used to evaluate various word representation and tokenization algorithms (even as new ones are introduced) in the scientific domain. To this end, we build an evaluation suite consisting of several downstream tasks and relevant datasets for each task. Furthermore, we use the constructed evaluation suite to test various word representation and tokenization algorithms.


SemanticZ at SemEval-2016 Task 3: Ranking Relevant Answers in Community Question Answering Using Semantic Similarity Based on Fine-tuned Word Embeddings

arXiv.org Artificial Intelligence

W e describe our system for finding good answers in a community forum, as defined in SemEval-2016, Task 3 on Community Question Answering. Our approach relies on several semantic similarity features based on fine-tuned word embeddings and topics similarities. In the main Subtask C, our primary submission was ranked third, with a MAP of 51.68 and accuracy of 69.94. In Subtask A, our primary submission was also third, with MAP of 77.58 and accuracy of 73.39.


Automatic Detection of Satire in Bangla Documents: A CNN Approach Based on Hybrid Feature Extraction Model

arXiv.org Artificial Intelligence

--Wide spread of satirical news in online communities is an ongoing trend. The nature of satires are so inherently ambiguous that sometimes it's too hard even for humans to understand whether it's actually satire or not. So, research interest has grown in this field. The purpose of this research is to detect Bangla satirical news spread in online news portals as well as social media. In this paper we propose a hybrid technique for extracting feature from text documents combining Word2V ecand TF-IDF. Using our proposed feature extraction technique, with standard CNN architecture we could detect whether a Bangla text document is satire or not with an accuracy of more than 96%. Satires can be considered as a literary form which involves a delicate balance between criticism and humor.